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Description 

Background of the Invention 

5 [0001] The invention relates to computer speech recognition, particularly to the recognition of spoken computer 
commands. When a spoken command is recognized, the computer performs one or more functions associated with the 
command. 

[0002] In general, a speech recognition apparatus consists of an acoustic processor and a stored set of acoustic 
models. The acoustic processor measures sound features of an utterance. Each acoustic model represents the acous- 
w tic features of an utterance of one or more words associated with the model. The sound features of the utterance are 
compared to each acoustic model to produce a match score. The match score for an utterance and an acoustic model 
is an estimate of the closeness of the sound features of the utterance to the acoustic model. 

[0003] The word or words associated with the acoustic model having the best match score may be selected as the 
recognition result. Alternatively, the acoustic match score may be combined with other match scores, such as additional 
15 acoustic match scores and language model match scores. The word or words associated with the acoustic model or 
models having the best combined match score may be selected as the recognition result. 

[0004] For command and control applications, the speech recognition apparatus preferably recognizes an uttered 
command, and the computer system then immediately executes the command to perform a function associated with the 
recognized command. For this purpose, the command associated with the acoustic model having the best match score 

20 may be selected as the recognition result. 

[0005] A serious problem with such systems, however, is that inadvertent sounds such as coughs, sighs, or spoken 
words not intended for recognition can be misrecognized as valid commands. The computer system then immediately 
executes the misrecognized commands to perform the associated functions with unintended consequences. 
[0006] US-A-4,239,936 discloses a speech recognition system in which ambient noise intensity is measured in par- 

25 allel to the input speech signals, with any recognition result assigned to the input speech signal being rejected when the 
intensity of the noise exceeds a predetermined standard value. 

Summary of the Invention 

30 [0007] It is an object of the invention to provide a speech recognition apparatus and method which has a high like- 
lihood of rejecting acoustic matches to inadvertent sounds or words spoken but not intended for the speech recognizer. 
[0008] It is another object of the invention to provide a speech recognition apparatus and method which identifies 
the acoustic model which is best matched to a sound, and which has a high likelihood of rejecting the best matched 
acoustic model if the sound is inadvertent or not intended for the speech recognizer, but which has a high likelihood of 

35 accepting the best matched acoustic model if the sound is a word or words intended for recognition. 

[0009] A speech recognition apparatus according to the invention comprises an acoustic processor for measuring 
the value of at least one feature of each of a sequence of at least two sounds. The acoustic processor measures the 
value of the feature of each sound during each of a series of successive time intervals to produce a series of feature 
signals representing the feature values of the sound. Means are also provided for storing a set of acoustic command 

40 models. Each acoustic command model represents one or more series of acoustic feature values representing an utter- 
ance of a command associated with the acoustic command model. 

[0010] A match score processor generates a match score for each sound and each of one or more acoustic com- 
mand models from the set of acoustic command models. Each match score comprises an estimate of the closeness of 
a match between the acoustic command model and a series of feature signals corresponding to the sound. Means are 

45 provided for outputting a recognition signal corresponding to the command model having the best match score for a cur- 
rent sound if the best match score for the current sound is better than a recognition threshold score for the current 
sound. The recognition threshold for the current sound comprises (a) a first confidence score if the best match score 
for a prior sound was better than a recognition threshold for that prior sound, or (b) a second confidence score better 
than the first confidence score if the best match score for a prior sound was worse than the recognition threshold for 

so that prior sound. 

[001 1] Preferably, the prior sound occurs immediately prior to the current sound. 

[0012] A speech recognition apparatus according to the invention may further comprise means for storing at least 
one acoustic silence model representing one or more series of acoustic feature values representing the absence of a 
spoken utterance. The match score processor also generates a match score for each sound and the acoustic silence 
55 model. Each silence match score comprises an estimate of the closeness of a match between the acoustic silence 
model and a series of feature signals corresponding to the sound. 

[0013] In this aspect of the invention, the recognition threshold for the current sound comprises the first confidence 
score (a1 ) if the match score for the prior sound and the acoustic silence model is better than a silence match threshold, 
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and if the prior sound has a duration exceeding a silence duration threshold, or (a2) if the match score for the prior 
sound and the acoustic silence model is better than the silence match threshold, and if the prior sound has a duration 
less than the silence duration threshold, and if the best match score for the next prior sound and an acoustic command 
model was better than a recognition threshold for that next prior sound, or (a3) if the match score for the prior sound 
and the acoustic silence model is worse than the silence match threshold, and if the best match score for the prior 
sound and an acoustic command model was better than a recognition threshold for that prior sound. 
[0014] The recognition threshold for the current sound comprises the second confidence score better than the first 
confidence score (bl ) if the match score for the prior sound and the acoustic silence model is better than the silence 
match threshold, and if the prior sound has a duration less than the silence duration threshold, and if the best match 
score for the next prior sound and an acoustic command model was worse than the recognition threshold for that next 
prior sound, or (b2) ,f the match score for the prior sound and the acoustic silence model is worse than the silence 
match threshold, and if the best match score for the prior sound and an acoustic command model was worse than the 
recognition threshold for that prior sound. 

[0015] The recognition signal may be. for example, a command signal for calling a program associated with the 
command In one aspect of the invention, the output means comprises a display, and the output means displays one or 
more words corresponding to the command model having the best match score for a current sound if the best match 
score for the current sound is better than the recognition threshold score for the current sound 

[0016] In another aspect of the invention, the output means outputs an unrecognizable-sound indication signal if the 
best match score for the current sound is worse than the recognition threshold score for the current sound. For example 
he output means may display an unrecognizable-sound indicator if the best match score for the current sound is worse 
than the recogn.tion threshold score for the current sound. The unrecognizable-sound indicator may comprise for 
example, one or more question marks. K 
[0017] The acoustic processor in the speech recognition apparatus according to the invention may comprise in 
25 word 8 m,CrOPh ° ne - EaCh S ° Und may be ' for exam P |e - a vocal sound . a "d ^ch command may comprise at least one 

vMed 1 ACC ° rding 10 3 fur,her aspect of the Invention, a speech recognition method as defined in claim 1 1 is pro- 

[0019] Thus according to the invention, acoustic match scores generally fall into three categories When the best 
match score ,s better than a "good" confidence score, the word or words corresponding to the acoust c -nSJelivlS 
the best match score almost always correspond to the measured sounds. On the other hand, when me bes tma ch 
score « worse than a "poor" confidence score, (he word corresponding to the acoustic mode, having me bes ma cn 
r h T 6r COrresp h ondS l ° the meaSUfed S ° UndS - When the best match *™* «■ ^tter than the "poor 

^^^^JTx^yT 1 ', C ° nfidenCe SC ° re ' W ° rd corres P°" di "9 •«» acoustic mode, having 
best match score has a high likelihood of corresponding to the measured sound when the previously recognized word 

was accepted as having a high likelihood of corresponding to the previous sound. When the best match soo re 1 bS 

h/° 0r ITT? ^ bUt iS W ° rSe than ,he " 900d " confidence score - the word corresponding toThe acoust 
model having the best match score has a low likelihood of corresponding to the measured sound when the prevtously 
recogn.zed word was rejected as having a low likelihood of corresponding to the previous sound HoweveMfTre s 

" TZTr 11 T 3 PreVi ° US,y rejeCted W ° rd and the Current word havi "9 tne best Zc" sco re be" 
T^nn l h ^ ,7 ^ SC ° re bUt W ° rSe tha " ,he " 900d " confid ence score, then the current word is also accepted 
as hav.ng a high hkel.hood of corresponding to the measured current sound 

h?s 2 aVioh^iL a roo! 1 n f ,heCO t r,fidenCe T reS aCC ° rding t0 ,he inVention " 3 speech ^cognition apparatus and method 
has a high likelihood of reject.ng acoust.c matches to inadvertent sounds or words spoken but not intended for the 

speech recogmzer. That is. by adopting the confidence scores according to the invention, a speech ^ eco^nifionippa 

the^est ZtT, * h | dentifi ? S ^ aC ° US,iC m ° de ' WhiCh iS beSt matched to a sound " a * * ■£»• likefih^ejecig 
he best matched acousfic model if the sound is inadvertent or not intended for the speech recognizer and has a qh 

likelihood of accepting the best matched acoustic mode, if the sound is a word or worSs intended for fhe speech recog- 

SO Brief Description of the Drawing 
[0021] 

^ Figure 1 is a block diagram of an example of a speech recognition apparatus according to the invention. 

Figure 2 schematically shows an example of an acoustic command model. 
Figure 3 schematically shows an example of an acoustic silence model. 
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Figure 4 schematically shows an example of the acoustic silence model of Figure 3 concatenated onto the end of 
the acoustic command model of Figure 2. 

Figure 5 schematically shows the states and possible transitions between states for the combined acoustic model 
of Figure 4 at each of a number of times t. 

Figure 6 is a block diagram of an example of the acoustic processor of Figure 1 . 
Description of the Preferred Embodiments 

[0022] Referring to Figure 1, the speech recognition apparatus according to the invention comprises an acoustic 
processor 1 0 for measuring the value of at least one feature of each of a sequence of at least two sounds. The acoustic 
processor 10 measures the value of the feature of each sound during each of a series of successive time intervals to 
produce a series of feature signals representing the feature values of the sound. 

[0023] As described in more detail, below, the acoustic processor may, for example, measure the amplitude of each 
sound in one or more frequency bands during each of a series of ten-millisecond time intervals to produce a series of 
feature vector signals representing the amplitude values of the sound. If desired, the feature vector signals may be 
quantized by replacing each feature vector signal with a prototype vector signal, from a set of prototype vector signals, 
which is best matched to the feature vector signal. Each prototype vector signal has a label identifier, and so in this case 
the acoustic processor produces a series of label signals representing the feature values of the sound. 
[0024] The speech recognition apparatus further comprises an acoustic command models store 12 for storing a set 
of acoustic command models. Each acoustic command model represents one or more series of acoustic feature values 
representing an utterance of a command associated with the acoustic command model. 

[0025] The stored acoustic command models may be, for example, Markov models or other dynamic programming 
models. The parameters of the acoustic command models may be estimated from a known uttered training text by, for 
example, smoothing parameters obtained by the forward-backward algorithm. (See, for example, F. Jelinek. "Continu- 
ous Speech Recognition By Statistical Methods." Proceedings of th* IFFF Vol. 64, No. 4, April 1976, pages 532-556.) 
[0026] Preferably, each acoustic command model represents a command spoken in isolation (that is, independent 
of the context of prior and subsequent utterances). Context-independent acoustic command models can be produced, 
for example, either manually from models of phonemes, or automatically, for example, by the method described by Lalit 
R. Bahl et al in U.S. Patent 4,759,068 entitled "Constructing Markov Models of Words From Multiple Utterances", or by 
any other known method of generating context-independent models. 

[0027] Alternatively, context-dependent models may be produced from context-independent models by grouping 
utterances of a command into context-dependent categories. A context can be, for example, manually selected, or auto- 
matically selected by tagging each feature signal corresponding to a command with its context, and by grouping the fea- 
ture signals according to their context to optimize a selected evaluation function. (See, for example, Lalit R Bahl et al 
"Apparatus and Method of Grouping Utterances of a Phoneme into Context-Dependent Categories Based on Sound- 
Similanty for Automatic Speech Recognition." U.S. Patent 5,195,167.) 

[0028] Figure 2 schematically shows an example of a hypothetical acoustic command model. In this example the 
acoustic command model comprises four states S1 , S2, S3, and S4 illustrated in Figure 2 as dots. The model starts at 
the initial state S1 and terminates at the final state S4. The dashed null transitions correspond to no acoustic feature 
signal output by the acoustic processor 10. To each solid line transition, there corresponds an output probability distri- 
bution over either feature vector signals or label signals produced by the acoustic processor 10. For each state of the 
model, there corresponds a probability distribution over the transitions out of that state. 

[0029] Returning to Figure 1 , the speech recognition apparatus further comprises a match score processor 14 for 
generating a match score for each sound and each of one or more acoustic command models from the set of acoustic 
command models in acoustic command models store 12. Each match score comprises an estimate of the closeness of 
a match between the acoustic command model and a series of feature signals from acoustic processor 1 0 correspond- 
ing to the sound. K 

[0030] A recognition threshold comparator and output 16 outputs a recognition signal corresponding to the com- 
mand model from acoustic command models store 12 having the best match score for a current sound if the best match 
score for the current sound is better than a recognition threshold score for the current sound. The recognition threshold 
for the current sound comprises a first confidence score from confidence scores store 18 if the best match score for a 
prior sound was better than a recognition threshold for that prior sound. The recognition threshold for the current sound 
comprises a second confidence score from confidence scores store 18, better than the first confidence score, if the best 
match score for a prior sound was worse than the recognition threshold for that prior sound. 

[0031] The speech recognition apparatus may further comprise an acoustic silence model store 20 for storing at 
least one acoustic silence model representing one or more series of acoustic feature values representing the absence 



15 



20 



25 



EP 0 625 775 B1 

of a spoken utterance. The acoustic silence model may be, for example, a Markov model or other dynamic programming 
model. The parameters of the acoustic silence model may be estimated from a known uttered training text by, for exam- 
ple, smoothing parameters obtained by the forward-backward algorithm, in the same manner as for the acoustic com- 
mand models. 

5 [0032] Figure 3 schematically shows an example of an acoustic silence model. The model starts in the initial state 
S4 and terminates in the final state S1 0. The dashed null transitions correspond to no acoustic feature signal output. To 
each solid line transition there corresponds an output probability distribution over the feature signals (for example fea- 
ture vector signals or label signals) produced by the acoustic processor 10. For each state S4 through S10, there cor- 
responds a probability distribution over the transitions out of that state. 
io [0033] Returning to Figure 1, the match score processor 14 generates a match score for each sound and the 
acoustic silence model in acoustic silence model store 20. Each match score with the acoustic silence model comprises 
an estimate of the closeness of a match between the acoustic silence model and a series of feature signals correspond- 
ing to the sound. 

[0034] In this variation of the invention, the recognition threshold utilized by recognition threshold comparator and 
output 16 comprises the first confidence score if the match score for the prior sound and the acoustic silence model is 
better than a silence match threshold obtained from silence match and duration thresholds store 22, and if the prior 
sound has a duration exceeding a silence duration threshold stored in silence match and duration thresholds store 22 
Alternatively, the recognition threshold for the current sound comprises the first confidence score if the match score for 
the prior sound and the acoustic silence model is better than the silence match threshold, and if the prior sound has a 
duration less than the silence duration threshold, and if the best match score for the next prior sound and an acoustic 
command model was better than a recognition threshold for that next prior sound. Finally, the recognition threshold for 
the current sound comprises the first confidence score if the match score for the prior sound and the acoustic silence 
model is worse than the silence match threshold, and if the best match score for the prior sound and an acoustic com- 
mand model was better than a recognition threshold for that prior sound. 

[0035] In this embodiment of the invention, the recognition threshold for the current sound comprises the second 
confidence score better than the first confidence score from confidence scores store 18 if the match score from match 
score processor 1 8 for the prior sound and the acoustic silence model is better than the silence match threshold and if 
the prior sound has a duration less than the silence duration threshold, and if the best match score for the next prior 
sound and an acoustic command model was worse than the recognition threshold for that next prior sound Alterna- 
tively, the recognition threshold for the current sound comprises the second confidence score better than the first con- 
fidence score ,f the match score for the prior sound and the acoustic silence model is worse than the silence match 
threshold, and if the best match score for the prior sound and an acoustic command model was worse than the recoq- 
nition threshold for that prior sound. 

[0036] In order to generate a match score for each sound and each of one or more acoustic command models from 
the set of acoustic command models in acoustic command models store 12, and in order to generate a match score for 
each sound and the acoustic silence model in acoustic silence model store 20, the acoustic silence model of Figure 3 
may be concatenated onto the end of the acoustic command model of Figure 2, as shown in Figure 4. The combined 
model starts in the initial state S1, and terminates in the final state S10. 

[0037] The states S1 through S10 and the allowable transitions between the states for the combined acoustic 
model of Figure 4 at each of a number of times t are schematically shown in Figure 5. For each time interval between 
t-n-1 and t=n , the acoustic processor produces a feature signal X n . 

[0038] For each state of the combined model shown in Figure 4, the conditional probability P(s t =So\X, X ) that 
state s, equals state S* at time t given the occurrence of feature signals X A through X t produced by the acoustic proc- 
essor 10 at times 1 through t, respectively, is obtained by Equations 1 through 10. 

P(s, = S1\X,...X t ) = [P(s M = S1) P(s, = S1|5 N1 = S1) m 
P(X,|s,= S1,s N1 = S1)] 1 J 

53 P(s t =S2\X v ..X t ) = [P(s N1 = S1) P(s, =S2|s M = S1) r 2 l 

P(X t \s t = S2, s M = Si)] L J 

+ P(s t = S1) P(s, = S2|s, = S1) 
+ [P(s t -i = 52) P(s, = S2\s t 1 = S2) 
P(*t\s t = S2, s M = S2)] 

55 
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P(s, = S3|X V ..X ( ) = [P(s M = S2) P(s, = S3|s M = S2) [3] 
P(X,|s, = S3, s M = S2)] 
+ P(s, = S2) P(s, = S3|s, = S2) 
+ [PCs,.! = S3) P(s, = S3|s M = S3) 
P(X,|s, = S3, s M = S3)] 

P(s t = S4|X v ..X t ) = [P(s M = S3) P(s, = S4\s t , = S3) [4] 
P(X,|s, = S4, s M = S3)] 
+ P(s, = S3) P(s, = S4|s, = S3) 

P(s t = SSIX^.X,) = [P(s M = S4) P(s, = S5|s f . n = S4) [5] 
P(X,|s, = S5, s M = S4)] 
+ lP(s t _, = S5) P(s, = S5|s M = S5) 
P(X,|s, = S5, s M = S5)] 



P(s, = S6|X V ..X,) = [PCs,^ = S5) P(s, = S6|s M = S5) [6] 
P(X,|s, = S6, s M = S5)] 
+ [P(Sf-i = S6) P(s, = S6\s t , = S6) 
20 P(X t \s t = $6, s M = S6)] 

P(s, = S7|X 1 ...X,) = [P(s M = S6) P(s, = S7|s, - = S6) [7] 
P(X,!s, = S7,s M =S6)] 
+ P(s N1 = S7) P(s, = S7|s M = S7) 
* 5 HX,|s, = S7, s M = S7)] 

P(s t = S8|X 1 ..,X I ) = [P(s M = S4) P(s, = S8|s M = S4) [8] 
P(X,Is, = S8, s M = S4)] 

30 

P(s t = S9|X v ..X f )=:[P(s M = S8) P(s, = S9|s, ! = S8) [9] 
P(X,|s, = S9, s M = S8)] 

P(s, = SIOIX^.X,) = P(s, = S4) P(s, = S10|s, = S4) [101 
35 + P(s t = S8) P(s t = S10|s, = S8) 

+ P(s f = S9) P(s, = S10|s, = S9) 

+ [P(s M = S7) P(s, = S10|s, ! = S7) 

P(X t \s t = S10, s M = S7)] 

+ [P(s M = S9) P(s, = S10|s M = S9) 
40 P(X t \s { = S10, s M = S9)] 

[0039] In order to normalize the conditional state probabilities to account for the different numbers of feature signals 
(X^.Xf) at different times t, a normalized state output score Q for a state a at timet can be given by Equation 11. 

o = [| j] 

n W 



[0040] Estimated values for the conditional probabilities P(s t =So\X, ...X, ) of the states (in this example, states S1 
through S10) can be obtained from Equations 1 through 10 by using the values of the transition probability parameters 
and the output probability parameters of the acoustic command models and acoustic silence model. 
[0041] Estimated values for the normalized state output score Q can be obtained from Equation 1 1 by estimating 
the probability P(X,) of each observed feature signal X- t as the product of the conditional probability P^X, ,) of feature 
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signal X, given the immediately prior occurrence of feature signal X M , multiplied by the probability P(X M ) of occurrence 
of the feature signal X M . The value of P(X,-|X M ) P(X M ) for all feature signals X, and X M may be estimated by counting 
the occurrences of feature signals generated from a training text according to Equation 12. 

N(X h X i A NIX; A 

P(XJX, 1 )P(X,^=4^^ M2] 

_ Af(X,,X M ) 
N 



[0042] In Equation 12, A/(X„ X M ) is the number of occurrences of the feature signal X/ immediately preceded by 
the feature signal X M generated by the utterance of the training script, and N is the total number of feature signals gen- 
erated by the utterance of the training script. 

[0043] From Equation 1 1 , above, normalized state output scores Q(S4, t) and Q(S10, t) can be obtained for states 
is S4 and S10 of the combined model of Figure 4. State S4 is the last state of the command model and is the first state 
of the silence model. State S10 is the last state of the silence model. 

[0044] In one example of the invention, a match score for a sound and the acoustic silence model at time t may be 
given by the ratio of the normalized state output score Q[S10,t] for state S10 divided by the normalized state output 
score Q[S4,t] for state S4 as shown in Equation 13. 



55 



Silence Start Match Score = °q^\^ [1 3] 



[0045] The time f=r start at which the match score for the sound and the acoustic silence model (Equation 13) first 
exceeds a silence match threshold may be considered to be the beginning of an interval of silence. The silence match 
threshold is a tuning parameter which may be adjusted by the user. A silence match threshold of 10 15 has been found 
to produce good results. 

[0046] The end of the interval of silence may, for example, be determined by evaluating the ratio of the normalized 
30 state output score Q[S10,t] for state S10 at time t divided by the maximum value obtained for the normalized state out- 
put score Q max [S10, r s{art ...f] for state S10 over time intervals r start through t. 



Silence End Match Score = - — ?IfJ 0, ' ] fi4i 

QmaxIS 10. 'start-'] 1 J 



[0047] The time t = f end at which the value of the silence end match score of Equation 14 first falls below the value 
of a silence end threshold may be considered to be the end of the interval of silence. The value of the silence end 
threshold is a tuning parameter which can be adjusted by the user. A value of 10- 25 has been found to provide good 
40 results. 

[0048] If the match score for the sound and the acoustic silence model as given by Equation 13 is better than the 
silence match threshold, then the silence is considered to have started the first time f star( at which the ratio of Equation 
13 exceeded the silence match threshold. The silence is considered to have ended at the first time r end at which the 
ratio of Equation 14 is less than the associated tuning parameter. The duration of the silence is then (f end - f start ). 
45 [0049] For the purpose of deciding whether the recognition threshold should be the first confidence score or the 
second confidence score, the silence duration threshold stored in silence match and duration thresholds store 22 is a 
tuning parameter which is adjustable by the user. A silence duration threshold of, for example, 25 centiseconds has 
been found to provide good results. 

[0050] The match score for each sound and an acoustic command model corresponding to states S1 through S4 
so of Figures 2 and 4 may be obtained as follows. If the ratio of Equation 13 does not exceed the silence match threshold 
prior to the time f end , the match score for each sound and the acoustic command model corresponding to states S1 
through S4 of Figures 2 and 4 may be given by the maximum normalized state output score GmaxfSIO, f e nd«. fend] for 
state S10 over time intervals f end through r end , where f' end is the end of the preceding sound or silence, andwhere r end 
is the end of the current sound or silence. Alternatively, the match score for each sound and the acoustic command 
model may be given by the sum of the normalized state output scores Q[S10,f] for state S10 overtime intervals f end 
through f end . However, if the ratio of Equation 13 exceeds the silence match threshold prior to the time f e nd. then The 
match score for the sound and the acoustic command model may be given by the normalized state output score Q[S4, 
'start! for state S4 at time r start . Alternatively, the match score for each sound and the acoustic command model may be 
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given by the sum of the normalized state output scores Q[S4,f] for state S4 over time intervals f end through f start . 
[0051] The first confidence score and the second confidence score for the recognition threshold are tuning param- 
eters which may be adjusted by the user. The first and second confidence scores may be generated, for example, as 
follows. 

[0052] A training script comprising in-vocabulary command words represented by stored acoustic command mod- 
els, and also comprising out-of-vocabulary words which are not represented by stored acoustic command models is 
uttered by one or more speakers. Using the speech recognition apparatus according to the invention, but without a rec- 
ognition threshold, a series of recognized words are generated as being best matched to the uttered, known training 
script. Each word or command output by the speech recognition apparatus has an associated match score., 
[0053] By comparing the command words in the known training script with the recognized words output by the 
speech recognition apparatus, correctly recognized words and misrecognized words can be identified. The first confi- 
dence score may, for example, be the best match score which is worse than the match scores of 99% to 100% of the 
correctly recognized words. The second confidence score may be, for example, the worst match score which is better 
than the match scores of, for example, 99% to 100% of the misrecognized words in the training script. 
[0054] The recognition signal which is output by the recognition threshold comparator and output 16 may comprise 
a command signal for calling a program associated with the command. For example, the command signal may simulate 
the manual entry of keystrokes corresponding to a command. Alternatively, the command signal may be an application 
program interface call. 

[0055] The recognition threshold comparator and output 16 may comprise a display, such as a cathode ray tube, a 
liquid crystal display, or a printer. The recognition threshold comparator and output 16 may display one or more words 
corresponding to the command model having the best match score for a current sound if the best match score for the 
current sound is better than the recognition threshold score for the current sound. 

[0056] The output means 16 may optionally output an unrecognizable-sound signal if the best match score for the 
current sound is worse than the recognition threshold score for the current sound. For example, the output 16 may dis- 
play an unrecognizable-sound indicator if the best match score for the current sound is worse than the recognition 
threshold score for the current sound. The unrecognizable-sound indicator may comprise one or more displayed ques- 
tion marks. 

[0057] Each sound measured by the acoustic processor 1 0 may be a vocal sound or some other sound. Each com- 
mand associated with an acoustic command model preferably comprises at least one word. 

[0058] At the beginning of a speech recognition session, the recognition threshold may be initialized at either the 
first confidence score or the second confidence score. Preferably, however, the recognition threshold for the current 
sound is initialized at the first confidence score at the beginning of a speech recognition session. 
[0059] The speech recognition apparatus according to the present invention may be used with any existing speech 
recognizer, such as the IBM Speech Server Series (trademark) product. The match score processor 14 and the recog- 
nition threshold comparator and output 16 may be, for example, suitably programmed special purpose or general pur- 
pose digital processors. The acoustic command models store 12, the confidence scores store 18, the acoustic silence 
model store 20, and the silence match and duration thresholds store 22 may comprise, for example, electronic readable 
computer memory. 

[0060] One example of the acoustic processor 10 of Figure 3 is shown in Figure 6. The acoustic processor com- 
prises a microphone 24 for generating an analog electrical signal corresponding to the utterance. The analog electrical 
signal from microphone 24 is converted to a digital electrical signal by analog to digital converter 26. For this purpose, 
the analog signal may be sampled, for example, at a rate of twenty kilohertz by the analog to digital converter 26. 
[0061] A window generator 28 obtains, for example, a twenty millisecond duration sample of the digital signal from 
analog to digital converter 26 every ten milliseconds (one centisecond). Each twenty millisecond sample of the digital 
signal is analyzed by spectrum analyzer 30 in order to obtain the amplitude of the digital signal sample in each of, for 
example, twenty frequency bands. Preferably, spectrum analyzer 30 also generates a twenty-first dimension signal rep- 
resenting the total amplitude or total power of the twenty millisecond digital signal sample. The spectrum analyzer 30 
may be, for example, a fast Fourier transform processor. Alternatively, it may be a bank of twenty band pass filters. 
[0062] The twenty-one dimension vector signals produced by spectrum analyzer 30 may be adapted to remove 
background noise by an adaptive noise cancellation processor 32. Noise cancellation processor 32 subtracts a noise 
vector N(t) from the feature vector F(t) input into the noise cancellation processor to produce an output feature vector 
F'(t). The noise cancellation processor 32 adapts to changing noise levels by periodically updating the noise vector N(t) 
whenever the prior feature vector F(t-1) is identified as noise or silence. The noise vector N(t) is updated according to 



the formula 



N(t) = N(t-1)+k[F(t-V-Fp(t-n 
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where N(t) is the noise vector at time t, N(t-1 ) is the noise vector at time (t-1 ), k is a fixed parameter of the adaptive noise 
cancellation model, F(M ) is the feature vector input into the noise cancellation processor 32 at time (t-1 ) and which rep- 
resents noise or silence, and Fp(t-1) is one silence or noise prototype vector, from store 34, closest to feature vector 
F(M). 

[0063] The prior feature vector F(t-1 ) is recognized as noise or silence if either (a) the total energy of the vector is 
below a threshold, or (b) the closest prototype vector in adaptation prototype vector store 36 to the feature vector is a 
prototype representing noise or silence. For the purpose of the analysis of the total energy of the feature vector, the 
threshold may be, for example, the fifth percentile of all feature vectors (corresponding to both speech and silence) pro- 
duced in the two seconds prior to the feature vector being evaluated. 

[0064] After noise cancellation, the feature vector F'(t) is normalized to adjust for variations in the loudness of the 
input speech by short term mean normalization processor 38. Normalization processor 38 normalizes the twenty-one 
dimension feature vector F'(t) to produce a twenty dimension normalized malized feature vector X(t). The twenty-first 
dimension of the feature vector F(f ), representing the total amplitude or total power, is discarded. Each component i of 
the normalized feature vector X(t) at time t may, for example, be given by the equation 

*/<0=^,<0-*(0 [16] 

in the logarithmic domain, where F,(f) is the i-th component of the unnormalized vector at time t, and where 2(t) is a 
weighted mean of the components of F(f) and Z(t - 1) according to Equations 17 and 18: 

Z(0 = 0.9Z(f-1)+0.1 M(t) [17] 

and where 

w (<>= 25X^(0 [18] 



[0065] The normalized twenty dimension feature vector X(t) may be further processed by an adaptive labeler 40 to 
adapt to variations in pronunciation of speech sounds. An adapted twenty dimension feature vector X{t) is generated 
by subtracting a twenty dimension adaptation vector A(t) from the twenty dimension feature vector X(t) provided to the 
input of the adaptive labeler 40. The adaptation vector A(t) at time t may, for example, be given by the formula 

, (0 . «*y 1 M i [ [lg] 

where k is a fixed parameter of the adaptive labeling model, X(M) is the normalized twenty dimension vector input to 
the adaptive labeler 40 at time (t-1), Xp(t-1 ) is the adaptation prototype vector (from adaptation prototype store 36) clos- 
est to the twenty dimension feature vector X(t-\) at time (t-1 ), and A(t-1 ) is the adaptation vector at time (t-1 ). 
[0066] The twenty dimension adapted feature vector signal X'(t) from the adaptive labeler 40 is preferably provided 
to an auditory model 42. Auditory model 42 may, for example, provide a model of how the human auditory system per- 
ceives sound signals. An example of an auditory mode! is described in U.S. Patent 4,980,918 to Bahl et al entitled 
"Speech Recognition System with Efficient Storage and Rapid Assembly of Phonological Graphs". 
[0067] Preferably, according to the present invention, for each frequency band i of the adapted feature vector signal 
X'(t) at time t, the auditory model 42 calculates a new parameter E f (t) according to Equations 20 and 21: 

*,-(0=*i + K 2 (X\(0XN/<M)) [20] 

where 

N i (t)=K 3 x /v-,.(M)-E,.(M) [21] 
and where K 1f K 2 , and K 3 are fixed parameters of the auditory model. 

[0068] For each centisecond time interval, the output of the auditory model 42 is a modified twenty dimension fea- 
ture vector signal. This feature vector is augmented by a twenty-first dimension having a value equal to the square root 
of the sum of the squares of the values of the other twenty dimensions. 
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[0069] For each centisecond time interval, a concatenates 44 preferably concatenates nine twenty-one dimension 
feature vectors representing the one current centisecond time interval, the four preceding centisecond time intervals, 
and the four following centisecond time intervals to form a single spliced vector of 189 dimensions. Each 189 dimension 
spliced vector is preferably multiplied in a rotator 46 by a rotation matrix to rotate the spliced vector and to reduce the 
spliced vector to fifty dimensions. 

[0070] The rotation matrix used in rotator 46 may be obtained, for example, by classifying into M classes a set of 
1 89 dimension spliced vectors obtained during a training session. The covariance matrix for all of the spliced vectors in 
the training set is multiplied by the inverse of the within-class covariance matrix for all of the spliced vectors in ail M 
classes. The first fifty eigenvectors of the resulting matrix form the rotation matrix. (See, for example, "Vector Quantiza- 
tion Procedure For Speech Recognition Systems Using Discrete Parameter Phoneme-Based Markov Word Models" by 
L. R. Bahl, et al, IBM Technical Disclosure Bulletin, Volume 32, No. 7, December 1989, pages 320 and 321.) 
[0071] Window generator 28, spectrum analyzer 30, adaptive noise cancellation processor 32, short term mean 
normalization processor 38, adaptive labeler 40, auditory model 42, concatenator 44, and rotator 46, may be suitably 
programmed special purpose or general purpose digital signal processors. Prototype stores 34 and 36 may be elec- 
tronic computer memory of the types discussed above. 

[0072] The prototype vectors in prototype store 34 may be obtained, for example, by clustering feature vector sig- 
nals from a training set into a plurality of clusters, and then calculating the mean and standard deviation for each cluster 
to form the parameter values of the prototype vector. When the training script comprises a series of word-segment mod- 
els (forming a model of a series of words), and each word-segment model comprises a series of elementary models 
having specified locations in the word-segment models, the feature vector signals may be clustered by specifying that 
each cluster corresponds to a single elementary model in a single location in a single word-segment model. Such a 
method is described in more detail in U.S. Patent Application Serial No. 730,714, filed on July 16, 1991, entitled "Fast 
Algorithm for Deriving Acoustic Prototypes for Automatic Speech Recognition." 

[0073] Alternatively, all acoustic feature vectors generated by the utterance of a training text and which correspond 
to a given elementary model may be clustered by K-means Euclidean clustering or K-means Gaussian clustering, or 
both. Such a method is described, for example, by Bahl et al in U.S. Patent 5,182,773 entitled "Speaker-Independent 
Label Coding Apparatus". 

Claims 

1. A speech recognition apparatus comprising: an acoustic processor (10) for measuring the value of at least one fea- 
ture of each of a sequence of at least two sounds, said acoustic processor (10) measuring the value of the feature 
of each sound during each of a series of successive time intervals to produce a series of feature signals represent- 
ing the feature values of the sound; 

means (12) for storing a set of acoustic command models, each acoustic command model representing one or 
more series of acoustic feature values representing an utterance of a command associated with the acoustic 
command model; 

a match score processor (14) for generating a match score for each sound and each of one or more acoustic 
command models from the set of acoustic command models, each match score comprising an estimate of the 
closeness of a match between the acoustic command model and a series of feature signals corresponding to 

the sound; 

characterized by 

means (16) for outputting a recognition signal corresponding to the command model having the best match 
score for a current sound if the best match score for the current sound is better than a recognition threshold 
score for the current sound, the recognition threshold for the current sound comprising (a) a first confidence 
score if the best match score for a prior sound was better than a recognition threshold for that prior sound or 
(b) a second confidence score better than the first confidence score if the best match score for a prior sound 
was worse than the recognition threshold for that prior sound. 

2. A speech recognition apparatus as claimed in Claim 1, characterized in that the prior sound occurs immediately 
prior to the current sound. 

3. A speech recognition apparatus as claimed in Claim 2, characterized in that: 

the apparatus further comprises means (20) for storing at least one acoustic silence model representing one 



10 



EP 0 625 775 B1 

or more series of acoustic feature values representing the absence of a spoken utterance; 

the match score processor (10) generates a match score for each sound and the acoustic silence model, each 
match score comprising an estimate of the closeness of a match between the acoustic silence model and a 
series of feature signals corresponding to the sound; and 

the recognition threshold for the current sound comprises the first confidence score (a1) if the match score for 
the prior sound and the acoustic silence model is better than a silence match threshold, and if the prior sound 
has a duration exceeding a silence duration threshold, or (a2) if the match score for the prior sound and the 
acoustic silence model is better than the silence match threshold, and if the prior sound has a duration less 
than the silence duration threshold, and if the best match score for the next prior sound and an acoustic com- 
mand model was better than a recognition threshold for that next prior sound, or (a3) if the match score for the 
prior sound and the acoustic silence model is worse than the silence match threshold, and if the best match 
score for the prior sound and an acoustic command model was better than a recognition threshold for that prior 
sound; or 

the recognition threshold for the current sound comprises the second confidence score better than the first 
confidence score (b1) if the match score for the prior sound and the acoustic silence model is better than the 
silence match threshold, and if the prior sound has a duration less than the silence duration threshold, and if 
the best match score for the next prior sound and an acoustic command model was worse than the recognition 
threshold for that next prior sound, or (b2) if the match score for the prior sound and the acoustic silence model 
is worse than the silence match threshold, and if the best match score for the prior sound and an acoustic com- 
mand model was worse than the recognition threshold for that prior sound. 

4. A speech recognition apparatus as claimed in Claim 3, characterized in that the recognition signal comprises a 
command signal for calling a program associated with the command. 

5. A speech recognition apparatus as claimed in Claim 4, characterized in that: 

the output means (16) comprises a display; and 

the output means (16) displays one or more words corresponding to the command model having the best 
match score for a current sound if the best match score for the current sound is better than the recognition 
threshold score for the current sound. 

6. A speech recognition apparatus as claimed in Claim 5, characterized in that the output means (16) outputs an 
unrecognizable-sound indication signal if the best match score for the current sound is worse than the recognition 
threshold score for the current sound. 

7. A speech recognition apparatus as claimed in Claim 6, characterized in that the output means (16) displays anun- 
recognizable-sound indicator if the best match score for the current sound is worse than the recognition threshold 
score for the current sound. 

8. A speech recognition apparatus as claimed in Claim 7, characterized in that unrecognizable-sound indicator com- 
prises one or more question marks. 

9. A speech recognition apparatus as claimed in Claim 1 , characterized in that the acoustic processor (1 0) comprises 
a microphone (24). 

10. A speech recognition apparatus as claimed in Claim 1, characterized in that: 

each sound comprises a vocal sound; and 
each command comprises at least one word. 

11. A speech recognition method comprising the steps of: 

measuring the value of at least one feature of each of a sequence of at least two sounds, the value of the fea- 
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ture of each sound being measured during each of a series of successive time intervals to produce a series of 
feature signals representing the feature values of the sound; 

storing a set of acoustic command models, each acoustic command model representing one or more series of 
5 acoustic feature values representing an utterance of a command ssociated with the acoustic command model; 

generating a match score for each sound and each of one or more acoustic command models from the set of 
acoustic command models, each match score comprising an estimate of the closeness of a match between the 
acoustic command model and a series of feature signals corresponding to the sound; 
10 characterized by 

outputting a recognition signal corresponding to the command model having the best match score for a current 
sound if the best match score for the current sound is better than a recognition threshold score for the current 
sound, the recognition threshold for the current sound comprising (a) a first confidence score if the best match 
75 score for a prior sound was better than a recognition threshold for that prior sound, or (b) a second confidence 

score better than the first confidence score if the best match score for a prior sound was worse than the rec- 
ognition threshold for that prior sound. 

12. A speech recognition method as claimed in Claim 1 1 , characterized in that the prior sound occurs immediately prior 
20 to the current sound. 

13. A speech recognition method as claimed in Claim 12, further comprising the steps of: 

storing at least one acoustic silence model representing one or more series of acoustic feature values repre- 
senting the absence of a spoken utterance; 

generating a match score for each sound and the acoustic silence model, each match score comprising an 
estimate of the closeness of a match between the acoustic silence model and a series of feature signals cor- 
responding to the sound; and characterized in that 

the recognition threshold for the current sound comprises the first confidence score (a1) if the match score for 
the prior sound and the acoustic silence model is better than a silence match threshold, and if the prior sound 
has a duration exceeding a silence duration threshold, or (a2) if the match score for the prior sound and the 
acoustic silence model is better than the silence match threshold, and if the prior sound has a duration less 
than the silence duration threshold, and if the best match score for the next prior sound and an acoustic com- 
mand model was better than a recognition threshold for that next prior sound, or (a3) if the match score for the 
prior sound and the acoustic silence model is worse than the silence match threshold, and if the best match 
score for the prior sound and an acoustic command model was better than a recognition threshold for that prior 
sound; or the recognition threshold for the current sound comprises the second confidence score better than 
the first confidence score (b1) if the match score for the prior sound and the acoustic silence model is better 
than the silence match threshold, and if the prior sound has a duration less than the silence duration threshold, 
and if the best match score for the next prior sound and an acoustic command model was worse than the rec- 
ognition threshold for that next prior sound, or (b2) if the match score for the prior sound and the acoustic 
silence model is worse than the silence match threshold, and if the best match score for the prior sound and 
an acoustic command mode! was worse than the recognition threshold for that prior sound. 

14. A speech recognition method as claimed in Claim 13, characterized in that the recognition signal comprises a com- 
mand signal for calling a program associated with the command. 

so 15. A speech recognition method as claimed in Claim 14, further comprising the step of displaying one or more words 
corresponding to the command model having the best match score for a current sound if the best match score for 
the current sound is better than the recognition threshold score for the current sound. 

16. A speech recognition method as claimed in Claim 15, further comprising the step of outputting an unrecognizable- 
55 sound indication signal if the best match score for the current sound is worse than the recognition threshold score 

for the current sound. 

17. A speech recognition method as claimed in Claim 16, further comprising the step of displaying an unrecognizable- 
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sound indicator if the best match score for the current sound is worse than the recognition threshold score for the 
current sound. 

18. A speech recognition method as claimed in Claim 17, characterized in that unrecognizable-sound indicator com- 
prises one or more question marks. 

19. A speech recognition method as claimed in Claim 11, characterized in that: 

each sound comprises a vocal sound; and 
each command comprises at least one word. 
Patentanspruche 

1. Spracherkennungseinrichtung, die Folgendes umfasst: 

einen Akustikprozessor (10) zum Messen des Wertes von mindestens einem Merkmal von jedem aus einer 
Folge von mindestens zwei Tonen, wobei der Akustikprozessor (10) den Wert des Merkmals jedes Tons wah- 
rend jedes aus einer Reihe aufeinanderfolgender Zeitintervalle misst, urn eine Reihe von Merkmalsignalen zu 
erzeugen, die die Merkmalwerte des Tons darstellen; 

Mittel (12) zum Speichern eines Satzes akustischer Befehlsmodelle, wobei jedes akustische Befehlsmodell 
eine oder mehrere Reihen akustischer Merkmalswerte darstellt, die eine AuSerung eines dem akustischen 
Befehlsmodell zugeordneten Befehls darstellen; 

einen Vergleichswertprozessor (14) zum Erzeugen eines Vergleichswertes furjeden Ton und jedes von einem 
Oder mehreren akustischen Befehlsmodellen aus dem Satz akustischer Befehlsmodelle, wobei jeder Ver- 
gleichswert eine Schatzung der Genauigkeit einer Ubereinstimmung zwischen dem akustischen Befehlsmo- 
dell und einer Reihe dem Ton entsprechender Merkmalsignale umfasst; 
gekennzeichnet durch: 

Mittel (16) zum Ausgeben eines Erkennungssignals, das dem Befehlsmodell mit dem besten Vergleichswert 
fur einen aktuellen Ton entspricht, falls der beste Vergleichswert fDr den aktuellen Ton besser als ein Erken- 
nungsschwellenwert fur den aktuellen Ton ist, wobei die Erkennungsschwelle fur den aktuellen Ton Folgendes 
umfasst: (a) einen ersten Vertrauenswert, falls der beste Vergleichswert fur einen fruheren Ton besser als eine 
Erkennungsschwelle fur diesen fruheren Ton war, oder (b) einen zweiten Vertrauenswert, der besser als der 
erste Vertrauenswert ist, falls der beste Vergleichswert fur einen fruheren Ton schlechter als die Erkennungs- 
schwelle fur diesen fruheren Ton war. 

2. Spracherkennungsvorrichtung nach Anspruch 1, dadurch gekennzeichnet, dass der fruhere Ton unmittelbar vor 
dem aktuellen Ton auftritt. 

3. Spracherkennungsvorrichtung nach Anspruch 2, dadurch gekennzeichnet, dass: 

die Vorrichtung aulierdem Mittel (20) zum Speichern von mindestens einem akustischen Schweigemodell 
umfasst, das eine oder mehrere Reihen akustischer Merkmalswerte darstellt, die das Nichtvorhandensein 
einer gesprochenen Aulierung darstellen; 

der Vergleichswertprozessor (10) fur jeden Ton und das akustische Schweigemodell einen Vergleichswert 
erzeugt, wobei jeder Vergleichswert eine Schatzung der Genauigkeit einer Qbereinstimmung zwischen dem 
akustischen Schweigemodell und einer Reihe von dem Ton entsprechenden Merkmalsignalen umfasst; und 

die Erkennungsschwelle fur den aktuellen Ton den ersten Vertrauenswert umfasst, (a1) falls der Vergleichswert 
fur den fruheren Ton und das akustische Schweigemodell besser als eine Schweigevergleichsschwelle ist und 
falls der fruhere Ton eine Dauer aufweist, die eine Schweigedauerschwelle Gbersteigt, oder (a2) falls der Ver- 
gleichswert fur den frOheren Ton und das akustische Schweigemodell besser als die Schweigevergleichs- 
schwelle ist und falls der frOhere Ton eine Dauer hat, die kurzer als die Schweigedauerschwelle ist und falls der 
beste Vergleichswert fur den nachsten frOheren Ton und ein akustisches Befehlsmodell besser als eine Erken- 
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nungsschwelle fQr diesen nSchsten frQheren Ton war, oder (a3) falls der Vergleichswert fQr den frOheren Ton 
und das akustische Schweigemodeli schlechter als die Schweigevergleichsschweile ist und falls der beste Ver- 
gleichswert fur den fruheren Ton und ein akustisches Befehlsmodell besser als eine Erkennungsschwelie fQr 
diesen fruheren Ton war; oder 

5 

dass die Erkennungsschwelie fQr den aktuellen Ton den zweiten Vertrauenswert umfasst, der besser als der 
erste Vertrauenswert ist, (b1) falls der Vergleichswert fur den fruheren Ton und das akustische Schweigemo- 
deli besser als die Schweigevergleichsschweile ist und falls der fruhere Ton eine Dauer hat, die kurzer als die 
Schweigedauerschwelle ist, und falls der beste Vergleichswert fur den nSchsten frQheren Ton und ein akusti- 
10 sches Befehlsmodell schlechter als die Erkennungsschwelie fur diesen nSchsten fruheren Ton war, oder (b2) 

falls der Vergleichswert fur den frQheren Ton und das akustische Schweigemodeli schlechter als die Schwei- 
gevergleichsschweile ist und falls der beste Vergleichswert fur den fruheren Ton und ein akustisches Befehls- 
modell schlechter als die Erkennungsschwelie fur diesen fruheren Ton war. 

15 4. Spracherkennungsvorrichtung nach Anspruch 3, dadurch gekennzeichnet, dass das Erkennungssignal ein 
Befehlssignal zum Aufrufen eines dem Befehl zugeordneten Programms umfasst. 

5. Spracherkennungsvorrichtung nach Anspruch 4, dadurch gekennzeichnet, dass: 

20 das Ausgabemittel (16) eine Anzeige umfasst; und 

das Ausgabemittel (16) eines oder mehrere Worte anzeigt, die dem Befehlsmodell mit dem besten Vergleichs- 
wert fur einen aktuellen Ton entsprechen, falls der beste Vergleichswert fur den aktuellen Ton besser als der 
Erkennungsschwellenwert fur den aktuellen Ton ist. 

25 

6. Spracherkennungsvorrichtung nach Anspruch 5, dadurch gekennzeichnet, dass das Ausgabemittel (16) ein Anzei- 
gesignal fur einen nicht erkennbaren Ton ausgibt, falls der beste Vergleichswert fur den aktuellen Ton schlechter 
als der Erkennungsschwellenwert fur den aktuellen Ton ist. 

30 7. Spracherkennungsvorrichtung nach Anspruch 6, dadurch gekennzeichnet, dass das Ausgabemittel (16) eine 
Anzeige fur einen nicht erkennbaren Ton anzeigt, falls der beste Vergleichswert fur den aktuellen Ton schlechter 
als der Erkennungsschwellenwert fur den aktuellen Ton ist. 

8. Spracherkennungsvorrichtung nach Anspruch 7, dadurch gekennzeichnet, dass die Anzeige fur einen nicht 
35 erkennbaren Ton ein oder mehrere Fragezeichen umfasst. 

9. Spracherkennungsvorrichtung nach Anspruch 1, dadurch gekennzeichnet, dass der Akustikprozessor (10) ein 
Mikrofon (24) umfasst. 

40 10. Spracherkennungsvorrichtung nach Anspruch 1, dadurch gekennzeichnet, dass: 
jeder Ton einen Vokalton umfasst; und 
jeder Befehl mindestens ein Wort umfasst. 

45 

11. Spracherkennungsverfahren, das die folgenden Schritte umfasst: 

Messen des Wertes von mindestens einem Merkmal von jedem aus einer Folge von mindestens zwei TOnen, 
wobei der Wert des Merkmals jedes Tons wShrend jeder aus einer Reihe aufeinanderfolgender Zeitintervalle 
so gemessen wird, um eine Reihe von Merkmaisignalen zu erzeugen, die die Merkmalwerte des Tons darstellen; 

Speichern eines Satzes akustischer Befehlsmodelle, wobei jedes akustische Befehlsmodell eine oder mehrere 
Reihen akustischer Merkmalswerte darstellt, die eine AufJerung eines dem akustischen Befehlsmodell zuge- 
ordneten Befehls darstellen; Erzeugen eines Vergleichswertes fQr jeden Ton und jedes von einem oder meh- 
55 reren akustischen Befehlsmodelien aus dem Satz akustischer Befehlsmodelle, wobei jeder Vergleichswert 

eine Schatzung der Genauigkeit einer Obereinstimmung zwischen dem akustischen Befehlsmodell und einer 
Reihe dem Ton entsprechender Merkmalsignale umfasst; 
gekennzeichnet durch 
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das Ausgeben eines Erkennungssignals, das dem Befehlsmodell mit dem besten Vergleichswert fur einen 
aktuellen Ton entspricht, falls der beste Vergleichswert fur den aktuellen Ton besser als ein Erkennungs- 
schwellenwert fur den aktuellen Ton ist, wobei die Erkennungsschwelle fur den aktuellen Ton Folgendes 
umfasst: (a) ein erster Vertrauenswert, falls der beste Vergleichswert fur einen frOheren Ton besser als eine 
Erkennungsschwelle fur diesen fruheren Ton war, Oder (b) ein zweiter Vertrauenswert, der besser als der erste 
Vertrauenswert ist, falls der beste Vergleichswert fur einen fruheren Ton schlechter als die Erkennungs- 
schwelle fur diesen fruheren Ton war. 

12. Spracherkennungsverfahren nach Anspruch 11, dadurch gekennzeichnet, dass der fruhere Ton unmittelbar vor 
dem aktuellen Ton auftritt. 

13. Spracherkennungsverfahren nach Anspruch 12, das aufierdem die folgenden Schritte umfasst: 

Speichern von mindestens einem akustischen Schweigemodell, das eine Oder mehrere Reihen akustischer 
Merkmalswerte darstellt, die das Nichtvorhandensein einer gesprochenen Aufterung darstelien; 

Erzeugen eines Vergleichswertes fOr jeden Ton und das akustische Schweigemodell, wobei jeder Vergleichs- 
wert eine Schatzung der Genauigkeit einer Obereinstimmung zwischen dem akustischen Schweigemodell und 
einer Reihe von dem Ton entsprechenden Merkmalsignalen umfasst; und das dadurch gekennzeichnet ist, 
dass 

die Erkennungsschwelle fur den aktuellen Ton den ersten Vertrauenswert umfasst, (a1) falls der Vergleichswert 
fur den frOheren Ton und das akustische Schweigemodell besser als eine Schweigevergleichsschwelle ist und 
falls der fruhere Ton eine Dauer aufweist, die eine Schweigedauerschwelle ubersteigt, oder (a2) falls der Ver- 
gleichswert fur den frOheren Ton und das akustische Schweigemodell besser als die Schweigevergleichs- 
schwelle ist und falls der fruhere Ton eine Dauer hat, die kurzer als die Schweigedauerschwelle ist und falls der 
beste Vergleichswert fOr den nSchsten frOheren Ton und ein akustisches Befehlsmodell besser als eine Erken- 
nungsschwelle fur diesen nSchsten fruheren Ton war, oder (a3) falls der Vergleichswert fOr den frOheren Ton 
und das akustische Schweigemodell schlechter als die Schweigevergleichsschwelle ist und falls der beste Ver- 
gleichswert fur den fruheren Ton und ein akustisches Befehlsmodell besser als eine Erkennungsschwelle fOr 
diesen fruheren Ton war; oder dass die Erkennungsschwelle fOr den aktuellen Ton den zweiten Vertrauenswert 
umfasst, der besser als der erste Vertrauenswert ist, (b1) falls der Vergleichswert fur den fruheren Ton und das 
akustische Schweigemodell besser als die Schweigevergleichsschwelle ist und falls der fruhere Ton eine 
Dauer hat, die kurzer als die Schweigedauerschwelle ist, und falls der beste Vergleichswert fur den nSchsten 
fruheren Ton und ein akustisches Befehlsmodell schlechter als die Erkennungsschwelle fOr diesen nSchsten 
frOheren Ton war, oder (b2) falls der Vergleichswert fur den fruheren Ton und das akustische Schweigemodell 
schlechter als die Schweigevergleichsschwelle ist und falls der beste Vergleichswert fur den fruheren Ton und 
ein akustisches Befehlsmodell schlechter als die Erkennungsschwelle fOr diesen frOheren Ton war. 

14. Spracherkennungsverfahren nach Anspruch 13, dadurch gekennzeichnet, dass das Erkennungssignal ein Befehls- 
signal zum Aufrufen eines dem Befehl zugeordneten Programms umfasst. 

15. Spracherkennungsverfahren nach Anspruch 14, das aufterdem den Schritt des Anzeigens eines oder mehrerer 
Worte umfasst, die dem Befehlsmodell mit dem besten Vergleichswert fur einen aktuellen Ton entsprechen, falls 
der beste Vergleichswert fOr den aktuellen Ton besser als der Erkennungsschwellenwert fOr den aktuellen Ton ist. 

16. Spracherkennungsverfahren nach Anspruch 15, das auBerdem den Schritt des Ausgebens eines Anzeigesignals 
fOr einen nicht erkennbaren Ton umfasst, falls der beste Vergleichswert fur den aktuellen Ton schlechter als der 
Erkennungsschwellenwert fOr den aktuellen Ton ist. 

17. Spracherkennungsverfahren nach Anspruch 1 6, das aufierdem den Schritt des Anzeigens einer Anzeige fOr einen 
nicht erkennbaren Ton umfasst. falls der beste Vergleichswert fur den aktuellen Ton schlechter als der Erkennungs- 
schwellenwert fur den aktuellen Ton ist. 

18. Spracherkennungsverfahren nach Anspruch 17, dadurch gekennzeichnet, dass die Anzeige fOr einen nicht erkenn- 
baren Ton eines oder mehrere Fragezeichen umfasst. 

19. Spracherkennungsverfahren nach Anspruch 1 1, dadurch gekennzeichnet, dass 
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jeder Ton einen Vokalton umfasst; und 

jeder Befehl mindestens ein Wort umfasst. 
5 Revendications 

1. Appareil de reconnaissance de la parole comprenant : 

un processeur acoustique (10) pour mesurer la valeur d'au moins une caracteristique de chaque son d'une 
w sequence d'au moins deux sons, ledit processeur acoustique (10) mesurant la valeur de la caracteristique de 

chaque son pendant chacun des intervalles d'une suite d'intervalles de temps successifs pour produire une 
suite de signaux de caracteristique representant les valeurs des caracteristiques du son ; 

un moyen (12) pour enregistrer un ensemble de modeles de commande acoustique, chaque modele de com- 
15 mande acoustique representant une ou plusieurs series de valeurs de caracteristiques acoustiques represen- 

tant un enonce d'une commande associe au modele de commande acoustique ; 

un processeur de resultat de concordance (14) pour creer un resultat de concordance pour chaque son et cha- 
que modele parmi un ou plusieurs modele de commande acoustique a partir de ('ensemble des modeles de 
20 commandes acoustiques, chaque resultat de concordance comprenant une estimation du rapport de concor- 

dance entre le modele de commande acoustique et une suite de signaux de caracteristique correspondant au 
son ; 

caracterise par 

25 un moyen (16) pour sortir un signal de reconnaissance correspondant au modele de commande ayant le 

meilleur score de concordance pour un son courant si le meilleur resultat de concordance pour le son courant 
est meilleur qu'un score du seuil de reconnaissance pour le son courant, le seuil de reconnaissance pour le 
son courant comprenant (a) un premier score de fiabilite si le meilleur resultat de concordance pour un son 
anterieur etait superieur a un seuil de reconnaissance pour ce son anterieur, ou (b) un deuxieme score de fia- 

30 bilite meilleur que le premier score de fiabilite si le meilleur resultat de concordance pour un son anterieur etait 

inferieur au seuil de reconnaissance de ce son anterieur. 

2. Appareil pour la reconnaissance de la parole selon la revendication 1 t caracterise en ce que le son anterieur pre- 
cede immediatement le son courant 

35 

3. Appareil de reconnaissance de la parole selon la revendication 2, caracterise en ce que : 

I'appareil comprend en outre un moyen (20) pour enregistrer au moins un modele acoustique de silence repre- 
sentant une ou plusieurs series de valeurs de caracteristique acoustique representant I'absence de tout 
40 enonce parle ; 

le processeur de resultat de concordance (10) genere un resultat de concordance pour chaque son et le 
modele de silence acoustique, chaque score de concordance comprenant une estimation du rapport de con- 
cordance entre le modele acoustique de silence et une suite de signaux de caracteristiques correspondant au 
45 son ; et 

le seuil de reconnaissance pour le son courant comprend un premier resultat de fiabilite (a1 ) si le score de con- 
cordance pour le son anterieur et le modele de silence acoustique et meilleur qu'un seuil de concordance de 
silence, et si le son anterieur a une duree depassant le seuil de la duree du silence, ou (a2) si le score de con- 

so cordance pour le son anterieur et le modele de silence acoustique est superieur au seuil de concordance du 

silence, et si le son anterieur a une duree inferieure a celle du seuil de silence, et si le meilleur resultat de con- 
cordance pour le son anterieur suivant et un modele de commande acoustique etait meilleur qu'un seuil de 
reconnaissance pour ce son anterieur suivant, ou (a3) si le score de concordance du son anterieur et du 
modele de silence acoustique etait inferieur au seuil de concordance du silence, et si le meilleur score de con- 

55 cordance pour le son anterieur et un modele de commande acoustique etait superieur a un seuil de reconnais- 

sance pour ce son anterieur ; ou 

le seuil de reconnaissance pour le son courant comprend un deuxieme score de fiabilite meilleur que le pre- 
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mier score de fiabilite (b1 ) si le score de concordance pour le son anterieur et le modele de silence acoustique 
est superieur au seuil de concordance du silence, et si le son anterieur a une duree inferieure au seuil de la 
duree du silence, et si le meilleur resultat de concordance pour le son anterieur suivant et un modele de com- 
mande acoustique etait moins bon que le seuil de reconnaissance de ce son anterieur, ou (b2) si le score de 
concordance pour le son anterieur et le modele de silence acoustique etait inferieur au seuil de concordance 
du silence, et si le meilleur resultat de concordance pour le son anterieur et un modele de commande acous- 
tique etait inferieur au seuil de reconnaissance pour ce son anterieur. 

4. Appareil de reconnaissance de la parole selon la revendication 3, caracterise en ce que le signai de reconnais- 
sance comprend un signal de commande pour appeler un programme associe a la commande. 

5. Appareil pour la reconnaissance de la parole selon la revendication 4, caracterise en ce que : 

le moyen de sortie (16) comprend un ecran ; et 

le moyen de sortie (16) affiche un ou plusieurs mots correspondant au modele de commande ayant le meilleur 
score de concordance pour un mot courant si le meilleur resultat de concordance pour le son courant est 
meilleur que !e resultat du seuil de reconnaissance pour le mot courant. 

6. Appareil pour la reconnaissance de la parole selon la revendication 5, caracterise en ce que le moyen de sortie (1 6) 
sort un signal indiquant qu'un son n'est pas reconnaissable si le meilleur score de concordance pour le son courant 
est inferieur au score du seuil de reconnaissance pour le mot courant. 

7. Appareil pour la reconnaissance de la parole selon la revendication 6, caracterise en ce que le moyen de sortie (1 6) 
affiche un indicateur indiquant qu'un son n'est pas reconnaissable si le meilleur score de concordance pour le son 
courant est inferieur au score du seuil de reconnaissance pour le mot courant. 

8. Appareil pour la reconnaissance de la parole selon la revendication 7,caracterise en ce que Pindicateur de son non 
reconnaissable comprend un ou plusieurs point d'interrogation. 

9. Appareil de reconnaissance de la parole selon la revendication 1, caracterise en ce que le processeur acoustique 
(10) comprend un microphone (24). 

10. Appareil pour la reconnaissance de la parole selon la revendication 1 caracterise en ce que : 

chaque son comprend un son vocal ; et 
chaque commande comprend au moins un mot. 

11. Methode de reconnaissance de la parole comprenant les phases qui consistent a: 

mesurer la valeur d'au moins une caracteristique de chaque son d'une sequence d'au moins deux sons, la 
valeur de la caracteristique de chaque son etant mesuree pendant chaque intervalle d'une suite d'intervalles 
de temps successifs pour produire une suite de signaux de caract6ristiques representant ies valeurs des 
caracteristiques du son ; 

enregistrer un ensemble de modeles de commandes acoustiques, chaque modele de commande acoustique 
representant une ou plusieurs series de valeurs de caracteristiques acoustiques representant un enonce d'une 
commande associe au modele de commande acoustique ; 

creer un resultat de concordance pour chaque son et chaque modele parmi un ou plusieurs modeles de com- 
mande acoustiques a partirde I'ensemble des modeles de commandes acoustiques, chaque resultat de con- 
cordance comprenant une estimation du rapport de concordance entre le modele de commande acoustique et 
une suite de signaux de caracteristique correspondant au son ; 
caracterisee par 

la sortie d'un signal de reconnaissance correspondant au modele de commande ayant le meilleur score de 
concordance pour un son courant si le meilleur resultat de concordance pour le son courant est meilleur que 
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le score du seuil de reconnaissance pour le son courant, le seuil de reconnaissance pour le son courant com- 
prenant (a) un premier score de fiabilite si le meilleur resultat de concordance pour un son anterieur etait supe- 
rieur a un seuit de reconnaissance pour ce son anterieur, ou (b) un deuxieme score de fiabilite meilleur que le 
premier score de fiabilite si le meilleur resuttat de concordance pour un son anterieur etait inferieur au seuil de 
5 reconnaissance de ce son anterieur. 

12. Methode pour la reconnaissance de la parole selon la revendication 11, caracterisee en ce que le son anterieur 
precede immediatement le son courant. 

w 13. Methode de reconnaissance de la parole selon la revendication 1 2, comprenant en outre les phases qui consistent 

a: 

enregistrer au moins un modele acoustique de silence representant une ou plusieurs series de valeurs de 
caracteristiques acoustiques representant Pabsence de tout enonce parle ; 

15 

generer un resultat de concordance pour chaque son et le modele de silence acoustique, chaque score de 
concordance comprenant une estimation du rapport de concordance entre le modele acoustique du silence et 
une suite de signaux de caracteristiques correspondant au son ; et caracterisee en ce que 

20 le seuil de reconnaissance pour le son courant comprend un premier resultat de fiabilite (a1 ) si le score de con- 

cordance pour le son anterieur et le modele du silence acoustique est meilleur qu'un seuil de concordance de 
silence, et si le son anterieur a une duree depassant le seuil de la duree du silence, ou (a2) si le score de con- 
cordance pour le son anterieur et le modele du silence acoustique est superieur au seuil de concordance du 
silence, et si le son anterieur a une duree inferieure a celle du seuil de silence, et si le meilleur resultat de con- 

25 cordance pour le son anterieur suivant et un modele de commande acoustique etait meilleur qu'un seuil de 

reconnaissance pour ce son anterieur suivant, ou (a3) si le score de concordance du son anterieur et du 
modele de silence acoustique etait inferieur au seuil de concordance du silence, et si le meilleur score de con- 
cordance pour le son anterieur et un modele de commande acoustique etait superieur a un seuil de reconnais- 
sance pour ce son anterieur ; ou !e seuil de reconnaissance pour le son courant comprend un deuxieme score 

30 de fiabilite meilleur que le premier score de fiabilite (b1) si le score de concordance pour le son anterieur et le 

modele du silence acoustique est superieur au seuil de concordance du silence, et si le son anterieur a une 
duree inferieure au seuil de la duree du silence, et si le meilleur resultat de concordance pour le son anterieur 
suivant et un modele de commande acoustique etait moins bon que le seuil de reconnaissance de ce son ante- 
rieur, ou (b2) si le score de concordance pour le son anterieur et le modele de silence acoustique etait inferieur 

35 au seuil de concordance du silence, et si le meilleur resultat de concordance pour le son anterieur et un 

modele de commande acoustique etait inferieur au seuil de reconnaissance pour ce son anterieur. 

14. Methode de reconnaissance de la parole selon la revendication 13, caracterisee en ce que le signal de reconnais- 
sance comprend un signai de commande pour appeler un programme associe a la commande. 

40 

15. Methode pour la reconnaissance de la parole selon la revendication 14, comprenant en outre la phase qui consiste 
a afficher un ou plusieurs mots correspondant au modele de commande ayant le meilleur score de concordance 
pour un mot courant si le meilleur resultat de concordance pour le son courant est superieur au score du seuil de 
reconnaissance pour le mot courant. 

45 

16. Methode pour la reconnaissance de la parole selon la revendication 15, comprenant en outre la phase qui consiste 
a emettre un signal indiquant qu'un son n'est pas reconnaissable si le meilleur score de concordance pour le son 
courant est inferieur au score du seuil de reconnaissance pour le mot courant. 

so 17. Methode pour la reconnaissance de la parole selon la revendication 16, comprenant en outre la phase qui consiste 
a afficher un indicateur indiquant qu'un son n'est pas reconnaissable si le meilleur score de concordance pour le 
son courant est inferieur au score du seuil de reconnaissance pour le mot courant. 

18. Methode pour la reconnaissance de la parole selon la revendication 17, caracterisee en ce que I'indicateurde son 
55 non reconnaissable comprend un ou plusieurs point d'interrogation. 

19. Methode pour la reconnaissance de la parole selon la revendication 1 1 , caracterisee en ce que : 
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chaque son comprend un son vocal ; et 
chaque commande comprend au moins un mot. 
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